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LI Cache 
161c / 16k 


L2 Cache 

(512 KB) 


440bx 


32 bytes / cache line 
CPU can access MM @ 800 MB/sec 
CPU can access L2 cache @ > 1.3 GB/scc 

CPU clocks to access L2 
39 C3*U clocks to access MM 
Original Celerons don’t have L2 cache 

Supercomputen don’t have cache memoiy, eitfier 

Worry about making your L2 access better 
Don’t worry about the L1 

there’s really nothing you can do there 




Main 

Memory 




✓. 





Memory Type R^ge Registers (MTRRs) 

Controls types of all memory spaces 

Write-Back (WB) 




Used by most of MM 
Cache lookups. Line reads 
MESI line transitions 


«/- 


ITT 


li >■ 


• > 


f. 




Modified, Exclusive, Shared, Invalid 

Uncacheable (UC) 

No cache lookup 
Reads not turned into line reads 






Posted writes. No speculative reads 
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Getting data from MM is slow 
Use data from cache for best performance 
Experimental data 

- 4x3 matrix transform of (n = 2000) vertices 

- 5 different data structures 

- vTune 3.x pause/resume API 

• Data Memory References (all) 

• L2 Cache Request Misses (highly correlated) 

• L2 Cache Requests 

- RDTSC (read timestamp counter) 

• Determine elapsed CPU cycles to complete 







struct si 


struct s3 




float X. y, z; 



class cl 


public: 
cl (void) 


struct s2 


float X, y, z; 
float nx, ny, nz; 
float tu, tv 

float r, g, b; 
float sr, sg, sb; 



inline void transform(float 

private: 
float X. y, z; 
float vx, vy, vz; 



struct s4 


float *^x; 

float '*^y; 
float ’^z; 
float ‘^vx; 

float 

float "^vz; 



float X, y, z, 
float vx, vy, vz; 
float nx, ny, nz 
float tu, tv, 
float r, g, b; 
float sr, sg, d); 
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Experimental Setup (pseudocode) 

void processSl(void) 

{ 

si •psrc * new(2000]; 
si *pdest = new|2000]; 

increasePriorityO; 

VtResumeSampUngO; // or RDTSC 

do 

{ 

// inline matrix multiply 
} while (count < cVertices); 

VtPauseSamplingO; H oi" RDTSC 

resumePriorityO; 


for(i = 0; i<10; i++) 

{ . 

processSlQ; ■ 

invalidateCacheQ; 


Event Name 
si Data Refs 

L2 Hisses 

L2 Reqs 

Sample: 

19,752 

14 

7,954 

sla Data Refs 

L2 Hisses 

_L2 Reqs_ 

cl Data Refs 

L2 Hisses 

_L2 Reqs_ 

s2 Data Refs 

L2 Hisses 

_ L2 Reqs_ 

s4 Data Refs 

1 1.2 Hisses 

1 _L2 Reqs 

s3 DataRefs 

L2 Hisses 

1.2 Reqs 

19,943 

3 

6,818 

19,304 

65 

7,901 

20,311 

24 

12,936 

20,389 

87 

7,964 

20,196 

452 

15,689 


Tringl-3 

19456 

4 

7501_ 

TringO CPU Cycles 
296 1,802,704 

10 

453_ 

19724 

219 

1, 

803,839 

3 

0 



6743_ 

75 



19150 

154 


297,085 

7 

58 



7530 

371 



20027 

284 

"X 

987,027 

21 

3 



12574 

362 



20042 

347 

"X 

256,448 

81 

6 



7518_ 

446 



20034 

162 


992,929 

452 

0 



15014 

675 
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void proc«siSl(void) 


void processSla(voidj> 


V 


psrc 
pdost 


■ew|2000|; 

120001; 


DWORD couat 






% M 






psrc 
pdest 


■ew|2 


III I 




120001; 


DWORD couat 
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vy 




psrc; 

pdest; 


/n 


*/ 
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L*-^. -V 


«, 

// ioUne matrix multiply 
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#r 


ki 


P++; 

-1 S 4 - 
0 “! 


couut++: 


-V* 




• a d 


* 


k * 




.V 


» » 


more variable 


^ 1 w 


psrc + 200 - 
pdest+ 200 














% s 


// inliue matrix multiply 
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k j 
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while (couu t < cV ertkes) 


} while (p 




psrc); 
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struct s 3 


float X. y, z, 
float vx, vy, 
float nx, ny, 
float tu, tv, 
float r, g, b; 
float sr, sg, sb; 


struct Vertex 


float x.y,z; 


struct TVertex 


float vx, vy, vz; 


struct Normal 


float nx, ny, nz; 


struct Color 


float r, g, b; 


struct Specular 


float ST, sg, sb; 


struct TxCoord 


float tu, tv. 








sbruct Object 


Vertex 


^vertices; 


Tvertex **^tverticcs; 

Normal **^0000 als; 
TxCoord **^txcoords; 

Color *^colors; 
Specular '*'pspeculars; 


You donH need all the data in a vertex all the time! 
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struct Vertex 
( 

float x.y.z; 

}; 

• Like ddtH is already batched together 

• Processing only a single piece of data at a 

time wastes the cache improvement made 

% 

- eg, transforming a single vertex 

. Perfonn all of a single task, before moving 

on the to the next task 
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Cache Miss 



Cache Miss 




Cache Miss 



Cadie Miss 



Cache Miss 


P’ 


Cache Miss 
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Cache Miss 


Cache Miss 


ci:ti 


iiasa. . ’IJo wt. iMM 
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Project 




Li^t 






Cache Miss 


Cache Miss 
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and so on 
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4. Performance Results 

demonstrated that the vertex structure containing only data relevant 
structureT^^^*^ multiplication performed significantly better than the other 


si 

Event Name 

Samples 

Ring 1-3 I 

Ring 0 [ 

CPU Cycles 

Data Refs_ 

19,752 

19456 ^ 

^96 

1,802,704 1 


1 L2 Misses 

14 

4 n 

lo 


si a 

L2 Regs 

7,954 

^7501 

453 


Data Refs 

19,943 

19724 I 

219 

1,803,839 1 


1 L2 Misses 

3 

3 ~1 

0 



L2 Regs 

6,818 

6743 

75 


cl 

Data Refs 

19,304 

19150 n 

154 

2,297,085 


L2 Miss^ 

65 

7 

58 



L2 Req^ 

7,901 

7530 

[371 1 


s2 

Data Refs^ ~ 

20,311 

20027 

284 

1,987,027 


L2 Misses 

24 

21 

3 



L2 Regs 

12,936 

12574 

362 


s4 

Data Refs 

20,389 

20042 

347 

2,256,448 


L2 Misses 

87 

81 

6 



L2 Regs 

7,964 

7518 

446 


s3 

Data Refs 

20,196 

20034 

162 

1,992,929 


L2 Misses 

452 

452 

0 



L2 Reqs 

15,689 

15014 

675 



Even though the total number of data references is nearly the same between the 
best and worst cases, the number of L2 requests twice as much in the worst 
case (s3) as opposed to the best case (s1). Also, there are over one hundred 
times as many cache misses in the worst case (s3) as compared to the best case 

(s1). 

For some cases, even though the cache performance improved, the actual 
performance time increased (e.g., s4) . This may be due to other factors which 
were not taken into account in these tests. It should be remembered that these 
are best case numbers since the thread and process priority were increased in 
order to obtain more reliable numbers. In actual implementation, the 
performance will probably be less. 


4.1 Counting Down 

Even though there are five test structures, there are six test cases presented. 
Tests s1 and s1a both utilized the same structure, but varied the implementation 

of the looping structure. 






















5. Batching Processing of Data 

Many applications process a single piece of data at a time, e.g. transforming a 
single vertex. With like data batched together into contiguous memory, 
processing only a single piece of data at a time wastes the cache improvement 

that’s been made through this data restructuring. 


Struct Vertex 

{ 

float x,y,z; 

); 



If only a single vertex were transformed at a time, the performance improvement 
made by restructuring this data would be lost since the improved caching of data 
wouldn’t be taken advantage of. Performing all of one process before moving on 
to the next process can significantly improve throughput to the CPU. However, 
there will be cases, based on the amount of data Involved, where piecemeal 
processing will be a performance benefit. 
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What 





Main Memory 


CPU 


DMA 



Chipset/ 

Memory 

Controiier 



Graphlci ControDer 


NIC 




MMIO 


-• i 









;■ i O;. 



. - 'C*; 


.ic2> 


* * 






Memory Mapped 10 (MMIO) 

CPU can rcad/write mapped memory on device 
CPU pushes/pulls data to/from device 

Direct Memory Access (DMA) 

PQ device can rcad/write main memory 

CPU doesn’t need to be involved 

Coherent - MC ensure relevant CPU caches written before device reads 
33 Mhz for consumer, 66 Mhz (aka “Fast PCI”) for server / workstations 

Bandwidth @ 33 Mhz 

1 DWORD (4 bytes) per clodc, 132 MB/sec 



























Only 1 device allowed per AGP “port” (ie, point to point bus) 

Step 2. Transfer 2 DWORDs per clock 

sample data on rising edge of clock and falling edge 
528 MB/sec bandwidth 
AGP 2x Mode 





























Memo 


NLVM) 


MalnNfeni' 


NLVM 


•Textures 
Commands 
Data 



Main memory reserved by AGP 

AGP can request all physical memory except for last 12 MB 

NLVM - non-local video memory 

Graphics Address Relocation Table (GART) in MC 

scatter/gather mechanism 

maps contiguous, logical gc addresses to non-contiguous physical addresses 

Cannot be paged to disk (virtual memory) 

Uncacheable Speculative (read) Write Combined memory (WC) 

Non-coherent memory, not CPU cached, not snooped 
Set by CPU Memory Type Range Registers (MTRR) MSR 
DMA only works on coherent memory 























Non-Coherent MM Access 


aeft dfiUi 


Chipset/ 

MC 



Graphics 

Controller 



Pipeline Mode (PIPE) 


Shared address and data line 


GC can either request data or receive data 



think of it like half-duplex 


GC can have multiple outstanding requests 


26 








Chipset/ 

MC 


Ad^rettLIs 


Dfds 


to R^cciv^Dflto 


Graphics 

Controller 


Side-bands (SB) 

Separate address 


data lines 




can simultaneously request and receive 

think of it like full-duplex 

can have multiple outstanding requests 

Maximizing AGP Performance 

http://www.agpforum.org 


•'uk 



















So 





Choices 




A good AGP platform has: 

High performing chipset/MC 
High performing GC 



Perform^ce will be bounded by lower 
performing component 


Transfer Mode 

MMIO 

DMA/Fast Pci/Frame 

Pipelined 
. Sideband 


AGP 1x 

X 

X 

• X 

X 


AGP 2x 


DMX 


Coherent 


non-Coherent 


depends on MTRRs 


X 

X 


X 


X 

Y 





X 

X 



GC can mix-and-match transfer methods 


eg 


MMIO for commands 

DMA triangles from MM 

Sideband textures from MM using DMX 



Note 



This means that different dato will be 
bounded by different bandwidth rates 
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Is there a way 


test throughput between 


r ' 




MM, NLVM, MM? 
One way is by copying 
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DirectDraw i s really a heap manager 

Allocate 2 surfaces of each memory type 


FromXTo 

MM 
NLVM 
LVM 


MM 


NLVM 


LVM 
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Labs Perm 





^4l 




(feiva 

gldd32.<ll 

product 


dcsdiplion 
Diaroond HRE GL 




PRO 


4 - 


vcr 


stfbva 


vendor device subsys 


rev 


2359 4172 


15623 22286482 


Intel i740 


driver 


description 


R3Dd32Kl.dU SUffirfrter AGP Release 0322, Standard 


product 


va 


subva 


vendor device subsys 
32902 30720 524349 


rev 


nVidia Riva TNT 


driver 


description 




NV4DD32 DLL NV idia RTV A TNT 


product 


ver 


subver 


vendor device subsys 


rev 




4318 


659099828 4 


















Test Loop 


for (D W ORD 


dwvirkith, h = dwheig^t; 


(w < (dwvridtti + dwbytes)) && (Ji < (diivhei^t+ dwbytes)); 
w+=64,h 


for (DWORD dwqueuedeptfa = dwqueuemfci; dw<]uaiedq>th < dwqueuemax; dwqueuedq)th-M-) 


for (DW OIU> Her = fter < dvidteratioD^ iter-H-) 


// increase priority 


Start Bits 


GetTimeStampC^start); 


for (DWORD q = 0; q < dwqueuedepth; q++) 


Create draw list 


hrblt = pdest-^BItFast(0,0, psrc, direct, DDBLTFAST_WAIT); 


GetTimeStamp(d^gua r antee); 


Submission Completed 


pde 5 t>Lock(p, &ddsd, DDLOCK_SURFACEMEMORYFrR | DDLOCKJVAIT, 0); 


pde$t->Unlock^)); 


GetT imeStamp (d^end); 


Bits Actually Complete 


// decrease priority 
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sooooo 


TNT 


LVM2LVM 


2«000 


200000 


ISDOOO 


100000 


Guaranteed throughput increase 


until 8th bit, then essentially holds 


Completed throughput 


peaks 


565 MB/sec 


High concurrency 
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9. Local Video Memory Bit Performance 


The guaranteed performance of the Permedia 2 continues to increase until the 
33'^'* bit Is queued, then performance drops precipitously. This could be due to a 

number of issues, such as waiting for the bit queue to clear. The completed 
throughput peaks at *-190 MB/sec. 


AViTl-V 


ii-rri; 






*1 ;my; 
















r V • -> C 7* J • • 3^ C . O C*. OO • <C ■*4 4 • - O 6 

\ - PI ♦ \ 5*1 r I ^ ^ ^ b; iT; •?# ^ 

3D Labs Permedia 2 ,’LVM->LVm‘MB/ sec 




3D Labs Permedia 2, LVM-> LVM, 72x72 pixels, 1 - 64 bits 


The TNT does not show the same type of performance degradation as more bits 

are added. Performance increases until the S*” bit, and then levels off. The 
completed throughput peaks at ~565 MB/sec. 
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10. Main Memory to LVM and NLVM Performance 

The performance of transfers to LVM and NLVM from main memory directly 
impacts application decisions including: How much can I bit each frame? And, 
can I upload new data to the controller during game play? 

in some cases, the bandwidth available to use for bitting may not be enoug, 
during each frame. 






Comparative MM->LVM 


A higher performing alternative might be to bit a surface in NLVM, and let the 
graphics controller pull the data from there. 
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Don’t Starve that CPU! 

Making the Best of Memory Bandwidth 


Herbert Marselas 

hmarselas@ensemblestudios.com 

Ensemble Studios 


Abstract 

With the increasing performance of microprocessors in today’s personal 
computers, the ability to move data quickly and efficiently to the main CPU, as 
well as to a hardware accelerated graphics controller, has become more 
important than ever. This paper exams the performance impact of structuring 
data for efficient processing by the main CPU by examining the performance 
characteristics of several vertex data structure arrangements employed with a 
matrix multiplication algorithm. This paper also exams the performance 
transferring data in a graphics subsystem utilizing the Microsoft DirectDraw API 
function to characterize actual and guaranteed memory throughput between 
main memory, non-local video memory, and local video memory. 


1. Experimental Setup 

All analysis of data structure performance were performed using an Intel 
Pentium® II processor based platforms with the following configuration: 

• 450 MHz processor 

• Intel 440bx chipset 

• 128 MB, 100 MHz, SDRAM 

Data transfer analysis of the hardware accelerated graphics subsystem was 
performed using two Intel Pentium® II processor based platforms with the 

following configuration: 

• Systemi: 400 MHz processor 

. System2: 450 MHz processor 

, Intel 440bx chipset 

, 256 MB, 100 MHz, SDRAM 


2. Main Memory to the CPU 

The Intel Pentium® II processor incorporates a two level cache^ system. The 
primary, or L1, cache incorporates separate 16k data and 16k instruction 
memory areas. The secondary, or L2, cache incorporates a unified 512 KB 

memory area. 



440bx 


Below the L2 cache, the Intel 440bx integrated chipset memory controller 
interfaces to a 100 MHz SDRAM main memory subsystem. Memory is typically 
transferred into the cache from main memory, or back to main memory from the 
cache, in multiples of 32 byte cache lines. 

Peak throughput from main memory is -800 MB/sec, while peak throughput from 

the L2 cache is -1.3 GB/sec. Not only is the throughput much slower from main 

memory, but the CPU pays a heavy price in clock cycles for accessing main 

memory. A CPU access to the L2 cache takes -4 CPU cycles, where as a CPU 

access to main memory requires -39 CPU cycles. Since throughput from main 

memory is only about half as much as from the L2 cache, and the CPU must wait 

nearly ten times as long for the operation to complete, the more that can be done 

to keep data in the L2 cache, the less time will be spent by the CPU waiting for 
data. 

Peak throughput from the LI cache is even higher than the L2. However, little 

can be done programmatically to keep code and data in the LI due to a number 

of factors including it’s relatively small size. This paper will then concentrate on 

ways to keep data in the L2, and reduce the number of main memory references 
required by the CPU. 


The performance of reading, or writing, to a particular memory address can also 
be affected by the memory type at that address. Five types of memory modes 
are supported by the Pentium® II processor® and are configurable via the 














































































































































































































































































































































































































































































































































































































































































































































































































































































processor’s Memory Type Range Registers (MTRRs). The most ^ . 

write-back (WB) memory. WB memory reads 32 byte cache lines 
memory and uses the cache for data lookup. 

Write-combined (WC, USWC) memory does not perform cache lookups 
reading directly from the memory source, but any writes to a single WC line ar 
combined in an internal CPU write-combining buffer. By cornbining writes o 
WC memory line, multiple writes to the same address space will be lost "if the me 
is not evicted between writes to the same memoiy address. In WC memory, 
reads can also pass writes, which means that reading from WC memory may, or 
may not, return the correct result. This is called non-coherent memory. The use 
WC memory causes a severe read penalty, but significantly increases the wri e 
performance to a memory region. The WC memory type is used for rms 
memory areas where many writes occur, but few reads occur. Norvloca vi eo 
memory (also known as AGP memory), as well as the frame buffer ot mos 

graphics controllers are marked WC. 

The other memory types, uncacheable (UC), write-through (WT) , and write 
protect (WP), are encountered less frequently. 


3. Structuring Data for Higher Performance 


In order to characterize the impact of data layout on processor performance, a 
test case was created using five layouts of a vertex data structure w i was 
used in a standard 4x3 matrix multiplication. 2000 of each data structure were 
allocated, then a loop was utilized to multiply each vertex by a 4x3 matrix, saving 
the result in the same structure, or in another structure. 


Two sets of tests were run 
vTune 


for each structure. The first test utilized the Intei 


Pause and Resume API to drive data gathering f'"9 .^®"!irr®pii 

CPU performance counters. Three types of data were collected using the CPU 

performance counters: 


Data memory references (all) 

L2 cache request misses (highly correlated) 

L2 cache requests 


This data allows us to determine how well the cache was utilized 


second test used the 


CPU Timestamp Counter (TSC) to determine the 

Both tests were executed 10 times 


^ rtf time each test took to complete. 

PTtnlcture, and the best times were taken. 

.-NO fhom^plves were chosen 

The data ^^JJjJg^gnted directly, or have encountered in other implementations 


based on the types of structures 


641 


strucr s i 



11 oat X, y, z; 


> 


struct s4 
{ 

float *px; 
float *py; 
float *pz; 
float *pvx; 
float *pvy; 
float *pvz; 


Struct s3 


struct s2 

{ 

float X, y, z, 
float nx, ny, nz 

float tu, tv; 
float r, g, b; 
float sr, sg, sb; 

}; 


class cl 




cl (void) 


{ 

float X, y, z, 
float vx, vy, vz; 
float nx, ny, nz 
float tu, tv; 
float r, g, b; 
float sr, sg, sb; 

}; 



private; 
float X, y, z; 
float vx, vy, vz; 



Each structure was used as part of a standardized loop: 

w 

void processSl(void) 

{ 

si *psrc = new[2000]; 
si *pdest = new[2000]; 

increasePriority(); 

VtResumeSampling(); // or RDTSC 

do 

{ 

// inline matrix multiply 
} while (count < cVertices); 

VtPauseSampling(); // or RDTSC 
resumePriority{); 

} 

In order to reduce the number of system effects, the thread and process priority 
were increased prior to running each test, and reduced after each test. In order 
to reduce the problem of cache effects between tests, new memory was 

allocated at the beginning of each test, and a large memory area was read into 
the cache between each test. 


for (i = 0; i < 10; i-H-) 
{ 

processSlQ; 
invalidateCacheO; 

} 


% 


4. Performance Results 

dennonstrated that the vertex structure containing only data relevant 
structure^^^^*^ multiplication performed significantly better than the other 


__I Event Name 

Samples 

Ring 1-3 ~~| 

Ring 0 

1 CPU Cycles 

SI 

1 Data Refs_ 

1 L2 Misses 

19,752 

19456 ^ 

296 

11,802,704 _j 

14 

4 n 

"10 

sla 

L2 Regs 

7,954 

^501 

453 1 1 

Data Refs 

19,943 

19724 I 

219 

1 1,803,839 J 

cl 

1 L2 Misses 

3 

3 ~1 

0 n 

L2 Regs 

6,818 

6743 

75 1 1 

Data Refs 

19,304 

19150 n 

1 154 

1 2,297,085 1 

1 L2 Misses 

65 

7 

1^8 1 1 

_1 L2 Regs 

7,901 

7530 

fsTi 1 ~1 

s2 

1 Data Refs ~ 

20,311 

20027 

1 284 

1^87,027 1 

s4 

1 L2 Misses 

24 

21 

13 

L2 Regs 

12,936 

12574 

362 1 1 

1 Data Refs 

20,389 

20042 

1 347 

1 2,256,448 | 

_1 L2 Misses 

87 

81 

6 1 1 

L2 Regs 

7,964 

7518 

446 1 

s3 

1 Data Refs 

20,196 

20034 

1 162 

1 1,992,929 1 

_ 1 L2 Misses 

452 

452 

[0 1 

1 L2 Reqs 

15,689 

15014 

675 


Even though the total number of data references is nearly the same between the 
best and worst cases, the number of L2 requests twice as much in the worst 
case (s3) as opposed to the best case (s1). Also, there are over one hundred 
times as many cache misses in the worst case (s3) as compared to the best case 

(s1). 

For some cases, even though the cache performance improved, the actual 
performance time increased (e.g., s4) . This may be due to other factors which 
were not taken into account in these tests. It should be remembered that these 
are best case numbers since the thread and process priority were increased in 
order to obtain more reliable numbers. In actual implementation, the 
performance will probably be less. 


4.1 Counting Down 

Even though there are five test structures, there are six test cases presented. 
Tests s1 and s1a both utilized the same structure, but varied the implementation 

of the looping structure. 



















processSI’s implementation is shown: 


void processSl(void) 

{ 

si *psrc = new[2000] ; 
si *pdest = new[2000]; 

DWORD count = 0; 

si *p = psrc; 
si *d = pdest; 

do 
{ 

// inline matrix multiply 
P++; 

d++; 

count++; 

} while (count < cVertices); 

Pointer variables track the source and destination memory addresses, and an 
integer counter tracks the number of vertices that are remaining to transform. 

% 

In processSl a, the integer counter variable is removed, and the source and 
destination pointers are started at the end of their respective memory address 
spaces. 

void processSla(void) 

{ 

si *psrc = new[2000]; 
si *pdest = new[2000]; 

DWORD count = 0; 

si *p = psrc + 200 - 1; 
si *d = pdest + 200 - 1; 

do 
{ 

// inline matrix multiply 

p—; 

d—; 

} while (p >= psrc); 

The loop is ended when the source pointer drops below the address of the start 

of the loop. This loop optimization provides another performance gain of *-1000 
CPU cycles. 
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4.2 Restructuring Vertex Data 


This data demonstrates that a significant performance improvement can be made 
by restructuring data so that only relevant data is utilized during processing. 
Applying this to the remaining data found in a typical vertex structure, the 
following reorganization of data can take place: 



struct s3 

{ 


Struct Vertex 

{ 

float X, y, z; 

}; 


struct Color 
{ 




struct Object 

i 

i 


float X, y, z, 
float vx, vy, vz; 
float nx, ny, nz 
float tu, tv; 
float r, g, b; 
float sr, sg, sb; 

}; 


Struct TVertex 

{ 

float vx, vy, vz; 

}; 

stmct N ormal 
{ 

float nx, ny, nz; 


struct Specular 
{ 

float sr, sg, sb; 

}; 

struct T xCoord 

{ 

float tu, tv; 


Vertex *pvertices; 
Tvertex "^ptvertices: 
N ormal * pnonnals; 
TxCoord *ptxcoords; 
Color *pcolors; 
Specular *pspeculars 

i - 



An individual vertex-based object can now be described using a structure of 
arrays of like data. This places like data in contiguous memory blocks for faster 
and more efficient cache access. 


Only during clipping operations, or reformatting for submission to an API, is ail 
the data in a vertex required. During other standard operations, only two or three 
pieces of data from the vertex are required. 


Operation 

Backface cull 
Transfo rm vertices 
Lightin 


Proiection 


Clipping 


Data Required _ 

normal (to face), vertex 
verte x, transformed vertex 
color 




ecular, normal 


Transformed vertex, proje cted vertex 

Everything, if it is clipped 
otherwise, only projected vertex 





















5. Batching Processing of Data 

Many applications process a single piece of data at a time, e.g. transforming a 
single vertex. With like data batched together into contiguous memory, 
processing only a single piece of data at a time wastes the cache improvement 

that’s been made through this data restructuring. 


Struct Vertex 

{ 

float X, y, z; 

}; 



If only a single vertex were transformed at a time, the performance improvement 
made by restructuring this data would be lost since the improved caching of data 
wouldn’t be taken advantage of. Performing all of one process before moving on 
to the next process can significantly improve throughput to the CPU. However, 
there will be cases, based on the amount of data Involved, where piecemeal 
processing will be a performance benefit. 
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6. Memory Throughput in the Graphics Subsystem 

in or^er to understand the performance impact of the memory types in the 
Qrep ICS subsystem, a basic understanding of the memory components and 

ransfer mechanisms in the system are required. 


6.1 Main Memory and PCI 


Peripheral Component Interconnect (PCI) devices can map memory that can 
appear to the host memory controller as any other memory space®. This system 
is called Memory Mapped I/O (MMIO), and allows the memory controller and 
CPU to read and write memory on a device as well as configure the memory type 
of MMIO regions using the CPUs MTRRs. 



In addition to the CPU being able to read or write directly to MMIO on a PCI 
device, the device itself can read or write directly to main memory using Direct 
Memory Access (DMA). A PCI device using DMA to access WB main memory 
addresses acts similarly to another CPU, in that the request is snooped, and all 
access is handled coherently. 


The PCI bus runs at 33 MHz standard, and 66 MHz in Fast PCI mode. At 4 
bytes (DWORD) per clock transferred, peak throughput is 132 MB/sec and 264 

MB/sec respectively. 


6.2 AGP 


At their heart, Accelerated Graphics Port (AGP) devices are an extension of PCI 
devices. Fast PCI is known as frame-based AGP, and Is the base AGP protocol. 
However, AGP has several extensions beyond standard PCI that allow it to 
handle more data at a time with lower latency sensitivity. 









































































































































































































































































































































































In AGP 2x mode, data is transferred on the rising and falling edges of the clock, 
thereby doubling the peak data throughput to 528 MB/sec. 





In addition to using DMA to access main memory, specialized non-local video 
memory (NLVM) can be allocated for specific use by AGP. The NLVM is 
memory type WC to the CPU, meaning that it can be written to quickly. This 
NLVM is coordinated on the AGP side by the Graphics Address Relocation Table 
(GART) that acts as a scatter-gather table to map multiple non-contiguous 
segments of main memory, into an apparently contiguous region of NLVM. All of 
main memory except for the last 12 MB can be utilized as NLVM by AGP. 


Since WC memory is a non-coherent memory type, and DMA can only reliably 
access coherent memory, a mechanism is required to access NLVM. Two such 
mechanisms exist in the AGP protocol: pipeline and sidebands. Pipeline mode 
incorporates a set of shared data and address lines to access NLVM. Sidebands 
uses separate data and address lines to access NLVM. The selection of pipeline 
or sidebands is purely based on the AGP device’s implementation. 

The final AGP extension, is the ability for the AGP device to execute data directly 
out of main memory. Unlike DMA which requests a copy of the data, Direct 
Memory Execution (DMX, also known as DIME, and a number of other non¬ 
standard acronyms) allows NLVM to be treated like extended memory by the 
graphics controller. 

Data Transfer Choices 


With this large number of AGP protocol and transfer choices, the situation 

rnore complex since a single device can use different transfer 
methods depending on the operation the device is performina. 



Transfer Mode 


MMIO 


DMA/Fast Pci/Frame 


Pipelined 


Sideband 


AGP lx 



X 


X 


X 


AGP 2x 


DMX 


Coherent non-Coherent 


depends on MTRRs 


X 


X 


X 



X 


X 
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A graphics device couid use MMIO for commands, DMA to get triangle primitive 

HMY rnemory, and sidebands to access textures from NLVM using 

UMX. This means that different types of data will be bounded by different peak 

data rates. 

7. Memory Bit Performance 

One type of data transfer that is used extensively in graphics is the Bit, or bit 

block transfer. Using the Microsoft DirectDraw API Bit and BItfast interfaces, 

memory surfaces were created in main memory (MM), NLVM, and local video 

memory (LVM). The performance bitting between the surfaces was then timed in 

order to determine the guaranteed time (the time in which the call returned), and 

the actual time (the time it took to complete copying the data form the source to 

the destination surface). Between these three memory areas, there resulted nine 
different test cases. 



Three recent graphics controllers were chosen at random: 3D Labs Permedia 2, 
Intel i740, and the nVidia Riva TNT. The most recent drivers were obtained from 
their respective web sites (see appendix B). 


7.1 Experimental Setup 

A program, bltalyzer.exe, as created and used to characterize bit performance on 
these three graphics controllers (see collateral, appendix A). The width and 
height of square memory areas was varied, as was the number of bits issued 

(queued) to the controller. 

Timing data was taken using the CPU read timestamp counter (RDTSC) 
instruction at three intervals during the test: before the first bit (start time), after 
the blt(s) were issued (submission or guarantee time), and after the bits were 
finally completed (completed time). The IDirectDrawSurface::Lock function was 
used on the destination surface in order to determine when the pending blt(s) 
actually completed. Each test was repeated ten times, and the best timing was 

used for analysis. 


for 



(DWORD w = dwwidth, h - dwheight; 

(w < (dwwidth + dwbytes)) && (h < (dwhftight + dwbytes)); 

w += 64, h += 64) 

for (DWORD dwqueuedepth = dwqueueitiin; 

dwqueuedepth < dwqueuemax; dwqueuedepth++) 

{ 

for (DWORD iter = 0; iter < dwiterations; iter++) 

{ 

// increase priority 


GetTimeStamp(Sqstart); 

for (DWORD q = 0; q < dwqueuedepth; q++) 

hrblt = pdest->BltFast(0, 0, psrc, &rect, 

DDBLTFAST WAIT); 


GetTimeStamp(&qguarantee); 

hr = pdest->Lock(0, &ddsd, DDLOCK_SURFACEMEMORYPTR | 

DDLOCK WAIT, 0); 



pdest->Unlock(0); 
GetTimeStamp(&qend); 
// decrease priority 


7.2 Charts Produced 


This setup of guaranteed time and completed time enabled the generation of two 
graphs. The guaranteed time demonstrated how guickly the graphics controller 
could accept bits, and the completed time demonstrated how long It actually too 
the controller to complete processing the data movement operation. 


Typically, the time that could be guaranteed based on submission was 
significantly higher than the actual time required to complete the operation. 
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3D Labs Permedia II LVM->LVM, Guaranteed MB/sec 
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This difference between the guaranteed time and actual completion .jgj. 

expression of the concurrency between the CPU and the chould 

Since the CPU and the graphics controller can execute in parallel, tn^ _ 

be a large difference between the two graphs. Conversely, a ^^rrxWe^r are 
difference between the graphs, the less able the CPU and graphics co 

able to operate In parallel. 



3D Labs Permedia 2. LVM->LVM. Completed MB/sec 


8, Comparative Main Memory to Main Memory Bit 

The performance bitting from one main memory surface to another, is nearly 
identical between the three controllers. 
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9. Local Video Memory Bit Performance 


The guaranteed performance of the Permedia 2 continues to increase until the 
33'^'* bit Is queued, then performance drops precipitously. This could be due to a 

number of issues, such as waiting for the bit queue to clear. The completed 
throughput peaks at *-190 MB/sec. 


AViTl-V 


ii-rri; 






*1 ;my; 
















r V • -> C 7* J • • 3^ C . O C*. OO • <C ■*4 4 • - O 6 

\ - PI ♦ \ 5*1 r I ^ ^ ^ b; iT; •?# ^ 

3D Labs Permedia 2 ,’LVM->LVm‘MB/ sec 




3D Labs Permedia 2, LVM-> LVM, 72x72 pixels, 1 - 64 bits 


The TNT does not show the same type of performance degradation as more bits 

are added. Performance increases until the S*” bit, and then levels off. The 
completed throughput peaks at ~565 MB/sec. 
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The i740, however, operates similarly to the Permedia 2, in that the bit 
performance Increases until the 8*^ or 16**^ bit, then decreases. The completed 
peak throughput is -150 MB/sec. 



Intel I740, LVM->LVM 
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10. Main Memory to LVM and NLVM Performance 

The performance of transfers to LVM and NLVM from main memory directly 
impacts application decisions including: How much can I bit each frame? And, 
can I upload new data to the controller during game play? 






































































In some cases, the bandwidth available to use for bitting may not be 
during each frame. 
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Comparative MM->LVM 




A higher performing alternative might be to bit a surface in NLVM, and let the 
graphics controller pull the data from there. 
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11. Unanswered Questions 


Testing the bit performance between 


subsystem only answers 
be uncovered Including 


square memory regions in the graphics 


set of questions. There is stiN much ir^ ormabon 




Performance of bitting non-square areas 

Performance of texture transfers between 
up sampling, down sampling, etc. 





















Porformance of primitive data transfers to the graphics controller testing, 
type, number, format, etc. 


Summary 


It IS important to analyze the data usage, data layout, and data flow in time 
sensitive algorithms. Placing relevant data together can certainly complicate 
data structure layout, but It can also significantly improve performance by 
improving cache usage and reducing cache misses. 


With relevant data grouped together, batching processing will lead to higher 
performance gains instead of piecemeal processing in most cases. 

In order to determine whether an algorithm or data structure is truly better, 
profilers that utilize the CPU performance counters, as well as accurate timing, 
should be used to collect objective data. 

Collateral 

An electronic version of this paper, presented foils, all data gathered and 
analyzed, as well as the source code of both analysis programs, are available at 
both the Game Developer Conference (GDC) web site at http://www.gdconf.com ; 
as well as at the Ensemble Studios web site http://www.ensemblestudios.com , 

under “Developer News.” 

The data files for data structure analysis are relatively small, however the data 
files for the graphics subsystem data transfer analysis comprise more than 30 
MB of spreadsheet data. 

All data Is stored In Microsoft PowerPoint 97, Microsoft Excel 97, and Microsoft 
Word 97 formats. 
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Appendix A 


I ,inn nerformance of the graphics subsystem 

The bltalyzer.exe program for anaiyz^ parameters. Examples of the usage of 


supports the following ^fpijed sample source code 

these parameters can be found in the supplied samp 


output=filename 

specify output filename 


xres=x 

width of surfaces to allocate 


yres=y 

height of surfaces to allocate 
src=[mm|nlvm|lvm] 

type of memory to bit from 

dest=[mm|nlvm|lvm 
type of memory to bit to 

op=[blt|bltfast] 
type of bit to perform 

iter=n 

number of iterations to perform 
bytes=n 

maximum number of bytes in width or height to bit 
width=n 

width of starting bit 
height=n 

height of starting bit 
qmin=n 

minimum number of bits to queue 
qmax=n 

maximum number of bits to queue 


Appendix B 


was gathered from the respective graphics controllers using 

e 'DirectDraw::GetDeviceldentifier function call. 

3D Labs Permedia 2 

driver description 

gldd32.dll Diamond FIRE GL 1000 PRO 


product 

ver 

subver 

bid 

vendor 

device 

subsys 

rev 

4 

10 

1 

2359 

4172 

15623 

22286482 

1 

Intel 1740 







driver 


description 






R3Dd32M. 

dll 

Starfighter 

AGP Release 0322, Standard 


product 

ver 

subver 

bid 

vendor 

device 

subsys 

rev 

4 

10 

1 

322 

32902 

30720 

524349 

33 

nVidia RIVA 128 






driver 


description 






NV4DD32. 

DLL 

NVidia RIVA 

TNT 





product 

ver 

subver 

bid 

vendor 

device 

subsys 

rev 

4 

10 

1 

48 

4318 

32 

659099828 

4 


